Goto

Collaborating Authors

 online policy selection



Online Adaptive Policy Selection in Time-Varying Systems: No-Regret via Contractive Perturbations

Neural Information Processing Systems

We study online adaptive policy selection in systems with time-varying costs and dynamics. We develop the Gradient-based Adaptive Policy Selection (GAPS) algorithm together with a general analytical framework for online policy selection via online optimization. Under our proposed notion of contractive policy classes, we show that GAPS approximates the behavior of an ideal online gradient descent algorithm on the policy parameters while requiring less information and computation. When convexity holds, our algorithm is the first to achieve optimal policy regret. When convexity does not hold, we provide the first local regret bound for online policy selection. Our numerical experiments show that GAPS can adapt to changing environments more quickly than existing benchmarks.




51200d29d1fc15f5a71c1dab4bb54f7c-AuthorFeedback.pdf

Neural Information Processing Systems

We would like to thank our reviewers for their thoughtful comments and feedback. However, to preserve anonymity, we can not share the link to the repository. Our most challenging tasks are locomotion tasks, which are not well suited for human demonstrations. But we believe this is an important direction for research as well. We will add this rationale to the paper.



Supplementary material A Detailed description of baselines A.1 Continuous Baselines

Neural Information Processing Systems

Multivariate Gaussian distributions, which is used as the final policy output. Humanoid experiments, the data consists of very diverse way of running). In Table 5, we show the hyperparameters shared among our baselines. Distributed Distributional Deep Deterministic Policy Gradient [ Barth-Maron et al., 2018 ] We used batch size 1024 for the experiments. Behavior Regularized Actor Critic [ Wu et al., 2019 ] is an actor critic algorithm where the We use the exact same network architecture as described in the original paper.




Online Adaptive Policy Selection in Time-Varying Systems: No-Regret via Contractive Perturbations

Neural Information Processing Systems

We study online adaptive policy selection in systems with time-varying costs and dynamics. We develop the Gradient-based Adaptive Policy Selection (GAPS) algorithm together with a general analytical framework for online policy selection via online optimization. Under our proposed notion of contractive policy classes, we show that GAPS approximates the behavior of an ideal online gradient descent algorithm on the policy parameters while requiring less information and computation. When convexity holds, our algorithm is the first to achieve optimal policy regret. When convexity does not hold, we provide the first local regret bound for online policy selection.